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CN . Abstract 

' In distributed learning, the goal is to perforin a learning task over data d i stribu ted across multiple 



O 
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> 



, nodes with minimal (expensive) communication. Prior work ( Daume III et al. . 2012 ) proposes a general 



5-H ' model that bounds the communication required for learning classifiers while allowing for e training error 



on linearly separable data adversarially distributed across nodes. 

In this work, we develop key improvements and extensions to this basic model. Our first result is a 
\^ , two-party multiplicative-weight-update based protocol that uses 0(d^logl/e) words of communication 

to classify distributed data in arbitrary dimension d, e-optimally. This readily extends to classification 
over k nodes with 0{kd^ log 1/e) words of communication. Our proposed protocol is simple to implement 
and is considerably more efficient than baselines compared, as demonstrated by our empirical results. 

In addition, we illustrate general algorithm design paradigms for doing efficient learning over dis- 
tributed data. We show how to solve fixed-dimensional and high dimensional linear programming effi- 
ciently in a distributed setting where constraints may be distributed across nodes. Since many learning 
problems can be viewed as convex optimization problems where constraints are generated by individual 
points, this models many typical distributed learning scenarios. Our techniques make use of a novel 
connection from multipass streaming, as well as adapting the multiplicative-weight-update framework 
, more generally to a distributed setting. As a consequence, our methods extend to the wide range of 

^SJ ■ problems solvable using these techniques. 

rn 

1 Introduction 

O 

, In recent years, distrib uted learning (learning fr om data spread across multiple locations) has witnessed a 
• • lot of research interest ( Bekkerman et al. . 201 ll ). One of the major challenges in distributed learning is to 



minimize com munication overhead be tween different parties, each possessing a disjoint subset of the data. 



^ , Recent work ( Daume III et al. . 2012 ) has proposed a distributed learning model that seeks to minimize 
I communication by carefully choosing the most informative data points at each node. The authors present a 
number of general sampling based results as well as a specific two-way protocol that provides a logarithmic 
bound on communication for the family of linear classifiers in M^. Most of their results pertain to two 
players but they propose basic extensions for multi-player scenarios. A distinguishing feature of this model 
is that it is adversarial. Except linear separability, no distributional or other assumptions are made on the 
data or how it is distributed across nodes. 

In this paper, we develop this model in two substantial ways. First, we extend the results on linear clas- 
sification to arbitrary dimensions, in the process presenting a more general algorithm that does not rely on 
explicit geometric constructions. This approach exploits the multiplicative weight update (MWU) frame- 
work (specifically its use in boosting) and retains desirable theoretical guarantees - data- size-independent 
communication between nodes in order to classify data - while being simple to implement. Moreover, it 
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easily extends to /c-players with an additional k communication over the two-player result, which improves 
the earlier results in two dimensions by a factor of k. A second contribution of this work is to demonstrate 
how general convex optimization problems (for example, linear programming, SDPs and the like) can be 
solved efficiently in this distributed framework using ideas from both multipass streaming, as well as the 
well-known multiplicative weight update method. Since many (batch) learning tasks can be reduced to 
convex optimization problems, this second contribution opens the door to deploying many other learning 
tasks in the distributed setting with minimal communication. 

Outline. Our main two-party result is proved in Section HI based on background in Section [21 Using a 
new sampling protocol for k players (Section [3|) we extend the two-party result to k players in Section [5] and 
present an empirical study in Section [6l In Section [7| we present our results for distributed optimization. 



Related Work. Existing work in distributed learning mainly focuses on either inferring an accurate 
global classifier from multiple distributed sub-classifiers learned individually (at respective nodes) or 
on improving the efficiency of the overall learning p r otocol. The first line of work co nsists o f tech - 
niques like paramete r mixing dMcDonald et al.l . I2OI0I : iMann et al.l . I2OO9I I or averaging (|Colhnsl . I2OO2I ) 



and classifier voting ( Bauer &: Kohavi . 19991 ). These approaches do admit conv ergence results but lack 
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any bounds on the communication. Voting, on the other hand, has been shown (jPaume III et al. 
to yield suboptimal results on adversarially partitioned datasets. The goal of the se cond line o f work 



is to make dist ributed algorithms scale to very large datasets; many of these works (Chu et al. . 200?! : 



Teo et al.l . hoid ) depend on MapReduce to extract performance improvement. IPekel et al.l (j2O10l ) aver- 



aged over mini-batches of accumulated gradients to improve regret bounds for distributed online settings. 
(jZinkevich et al.l. hoid) proposed a M apReduce based improved parallel stochastic gradient descent and 
more recently (jServedio &: Lona. 120111) improved the tii ne complexity of 7 - margi n parallel algorithms from 
$7(1/7^) to 0(1/7). Finally, ( Duchi et al. . 20ld ) and ( Agarwal &: Duchi . 2011 ) consider optimization in 
distributed settings but their convergence analysis applies to specific cases of subgradient and stochastic 
gradient descent algorithms. 

Surprisingly, commun ication in learning has not been studied as a resource to be used sparingly. And as 
(jDaume III et al.l . l2012l ) and this work demonstrates, intelligent interaction between nodes, communicating 
relevant aspects of the data, not just its classification, can greatly reduce the necessary communication 
over existing approaches. On large distributed systems, communication has become a major bottleneck for 
many real-world problems; it accounts for a large percentage of total energy costs, and is the main reason 
that MapReduce algorithms are designed to minimize rounds (of communication) . This strongly motivates 
the need to incorporate the study of this aspect of an algorithm directly, as presented and modeled in this 
paper. 

Recently but indepen dently, research by ( Balcan et al. . 20121 ) considers very similar models to those of 
(jDaume III et al.l . l2012l ). They also consider adversarially distributed data among k parties and attempt 
to learn on t he adversarially distrib uted data whi le minimizing the total communication between the 
parties. Like ( Daume III et al. . 2012) the work of ( Balcan et al. . 20121 ) presents both agnostic and non- 
agnostic results for generic settings, and shows improvements over sampling bounds in several specific 
settings including the d-dimensional linear classifier problem we consider here (also drawing inspiration from 
boosting). In addition, their work provides total communication bounds for decision lists and for proper 
and non-proper learning of parity functions. They also extend the model so as to preserve differential and 
distributional privacy while conserving total communication, as a resource, during the learning process. 

In contrast, this work identifies optimization as a key primitive underlying many learning tasks, and 
focuses on solving the underlying optimization problems as a way to provide general communication- 
friendly distributed learning methods. We introduce techniques that rely on multiplicative weight updates 
and multi-pass streaming algorithms. Our main contributions are translating these techniques into this 
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distributed setting and using them to solve LPs (and SDPs) in addition to solving for d-dimensional linear 
separators. 



2 Background 

In this section, we revisit the model proposed in (jPaume III et all I2OI2I I and mention related results. 



Model. We assume that there are k parties Pi, P2, ■ ■ ■ Pk- Each party Pi possesses a dataset Di that no 
other party has access to, and each Di may have both positive and negative examples. The goal is to 
classify the full dataset D = UjDj correctly. We assume that there exists a perfect classifier h* from a 
family of classifiers !K with associated range space {D,'K) and bounded VC-dimension v. We are willing 
to allow e-classification error on D so that up to e|D| points in total are misclassified. 

Each word of data (e.g., a single point or vector in M'^ counts as 0{d) words) passed between any pair 
of parties is counted towards the total communication; this measure in words allows us to examine the 
cost of extending to d-dimensions, and allows us to consider communication in forms other than example 
points, but does not hinder us with precision issues required when counting bits. For instance, a protocol 
that broadcasts a message of M words (say M/d points in M'') from one node to the other k — 1 players 
costs 0{kM) communication. The goal is to design a protocol with as little communication as possible. 
We assume an adversarial model of data distribution; in this setting we prepare for the worst, and allow 
some adversary to determine which player gets which subset of D. 

Sampling bounds. Given any dataset D and a family of classifiers with bounded VC-dimension v, then 
a random sample of size 

s,,, = 0{xnm{{u/e) \og{u / e),u / e^}) (2.1) 



from D has at most e-classification error on D with constant probability ( Anthonv &: Bartlett . 20091 ). 



as 



long as there exists a perfect classifier. Throughout this paper we will assume that a perfect classifier 
exists. This constant probability of success can be amplified to any 1 — 5 with an extra O (log (1/5)) factor 
of samples. 

Randomly partitioned distributions. Assume that for all i E [1, A;], each party Pi has a dataset Di drawn 
from the same distribution. That is, all datasets Di are identically distributed. This case is much simpler 
than what the remainder of this paper will consider. Using (12. ip . each Di can be viewed as a sample from 
the full set D = UiDi, and with no communication each party Pi can faithfully estimate a classifier with 
error 0((z./|A|)log(i^|A|)). 

Henceforth we will focus on adversarially distributed data. 

One-way protocols. Consider a restricted setting where protocols are only able to send data from parties 
Pi (for i > 2) to Pi; a restricted form of one-way communication. We can again use ()2.ip so that all 
parties Pi send a sample Si of size Ss^y to Pi, and then Pi constructs a global classifier on uf^2'S'j with 
e-classification error U^^^ ^ZJ^; this requires 0{dks e_y) words of communication for points in M*^. 



For specific classifiers iDaume III et al.l (l2012l ) do better. For thresholds and intervals one can learn 
a 2ero-error distributed classifier using constant amount of one-way communication. The same can be 
achieved for axis-aligned rectangles with 0{kd?) words of communication. However, those authors show 
that hyperplanes in R"^, for d > 2, require at least Q{k/e) one-way bits of communication to learn an 
e-error distributed classifier. 
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Two-way protocols. Hereafter, we co nsider two-way p r otoco ls where any two players can communicate 



back and forth. It has been shown (jPaume III et al.1 . l2012l ) that, in W, a protocol can learn linear 
classifiers with at most e-classification error using at most 0(A:^logl/e) communication. This protocol is 
deterministic and relies on a complicated pruning argument, whereby in each round, either an acceptable 
classifier is found, or a constant fraction more of some party's data is ensured to be classified correctly. 

3 Improved Random Sampling for /c-players 

Our first contribution is an improved two-way A;-player sampling-based protocol using two-way communi- 
cation and the sampling result in (j2.ip . We designate party Pi as a coordinator, and it gathers the size of 
each player's dataset Di, simulates sampling from each player completely at random, and then reports back 
to each player the number of samples to be drawn by it, in 0{k) communication. Then each other party 
Pi selects Se^i/|Dj|/|D| random points (in expectation), and sends them to the coordinator. The union of 
this set satisfies the conditions of the result from ()2.ip over D = UiDi and yields the following result. 

Theorem 3.1. Consider any family of hypothesis that has VC- dimension v for points in . Then there 
exists a two-way k -player protocol using 0{kd + dmm{{u/e)log{u/£),i^/e^}) total words of communication 
that achieves e- classification error, with constant probability. 

Again using two-way communication, this type of result can be made even more general. Consider 
the case whe re each P,;'s dataset a rrives in a continuous stre am; this is what is kr iown as a distributed 
data stream ( Cormode et al. . 20081 ). Then applying results of ( Cormode et al. . 2O10l ). we can continually 



maintain a sufficient random sample at the coordinator of size communicating 0{{k + Ss,u)dlog \D\) 
words. 

Theorem 3.2. Consider any family of hypothesis that has VC-dimension v for points in . Let each 
of k parties have a stream of data points Di where D = UiDi. Then there exists a two-way k-player 
protocol using 0((A; + min{(z^/e) log(z^/e), i^/e^}) d\og\D\) total words of communication that maintains 
e- classification error, with constant probability. 



4 A Two- Party Protocol 

In this section, we consider only two parties, and for notational clarity, we refer to them as A and B. 
A's dataset is labeled Da and P's dataset is labeled Db- Let \Db\ = n. Our protocol, summarized 
in Algorithm [H is called WeightedSampling. In each round, A sends a classifier h^ to B and B 
responds back with a set of points Rb, which it constructs by sampling from a weighting on its points. 
At the end of T rounds (for T = 0(log(l/e))), we will show that by voting on the result from the set 
of T classifiers will misclassify at most e|-D_B| points from Db while being perfect on D^, and hence 
elD^I < e\DB U Da\ = £:\D\, yielding a e-optimal classifier as desired. 

There are two ways Rb can construct its points: a random sample and a deterministic sample. For 
simplicity, we will focus our presentation on the randomized version since it is more practical, although it 
has slightly worse bounds in the two-party case. Then we will also mention and analyze the deterministic 
version. 

It remains to describe how i?'s points are weighted and updated, which dictates how B constructs the 
sample sent to A. Initially, they are all given a weight wi = 1. Then the re- weighting strategy (described 
in Algorithm [2]) is an instance of the multiplicative weight update framework; with each new proposed 
classifier Ha from A, party B increases all weights of misclassified points by a (1 + p) factor, and does 
not change the weight for correctly classified points. We will show p = 0.75 is sufficient. Intuitively, this 
ensures that consistently misclassified points eventually get weighted high enough that they are very likely 
to be chosen as examples to be communicated in future rounds. The deterministic variant simply replaces 
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Line 7 of Algori thm [2] with the weighted variant ( Matousek . 199ll ) of the deterministic construction of Rb 



(jChazellel . bood ): see details below. 



Note that this is roughly similar in spirit to the heuristic protocol ( Daume III et al. . 20121 ) that exchanged 
support points and was called IterativeSupports, which we will experimentally compare against. But 
the protocol proposed here is less rigid, and as we will demonstrate next, this allows for a much less nuanced 
analysis. 



Algorithm 1 WeightedSampling 
Input: Da,Db, parameters: < e < 1 
Output: hAB (classifier with e-error on Da U Db) 
Init: Rb = {}; = 1 Vxi G Db; 
for t = 1 ... T = 51og2(l/e) do 

A's move 

Da = DaURb; 

h\ := Learn{DA)] 

send h\ to B; 

B's move 

Rb ■■= Mwu {Db, h\, p = 0.75, c = 0.2); send Rb to A\ 
end for 

hAB = Majority(/i^, h\, . . . , h^); 



Algorithm 2 Mwu {Db, h\, p, c) 
1: Input: /i^, Db, parameters: 0<p<l, 0<c<l 
2: Output: Rb {& set of Sc^d points) 
3: for all {xi e Db) do 

4: ii{h'^Aixi) + Vi) t^en w^^^ = w\{l + p); 
5: ii{h\{xi) == Ui) then w^^^ = w\; 
6: end for 

7: randomly sample Rb from Db (according to w^^^); 



4.1 Analysis 

Our analysis is based on the multiplicative weight update framework (and closely resembles boosting). 
First, we state a key structural lemma. Thereafter, we use this lemma to prove our main result. 

As mentioned above (see (j2.ip ). after collecting a random sample Se of size = 0{mhi{{d/ e) \og{d / e) , d / e^}) 
drawn over the entire dataset D dW^, a, linear classifier learned on Se is sufficient to provide e-classification 
error on all of D with constant probability. There exist deterministic constructions for these samples 
still of size s^^y ( Chazelle . 200d ): although they provide at most e-classification error with probability 1, 



they, in general, run in time exponential in v. Note that the VC-dimension of linear classifiers in M is 
0{d), and t hese results still holds when the points are weighted and the sample is drawn (respectively 
constructed ( Matousek . 199ll )) and error measured with respect to this weighting distribution. Thus B 



could send s^^d points to A, and we would be done; but this is too expensive. We restate this result with 
a constant c, so that at most a c fraction of the weights of points are mis-classified (later we show that 
c = 0.2 is sufficient with our framework). Specifically, setting e = c and rephrasing the above results yields 
the following lemma. 

Lemma 4.1. Let B have a weighted set of points Db with weight function w : Db K+. For any 
constant c > 0, party B can send a set Sc^ of size 0{d) (where the constant depends on c) such that any 
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linear classifier that correctly classifies all points in Sc^d will misclassify points in Db with a total weight at 
most c^^.g£,^ w{x). The set Sc,d can be constructed deterministically, or a weighted random sample from 
{Db,w) succeeds with constant probability. 

We first state the bound using the deterministic construction of the set Sc,d, and then extend it to the 
more practical (from a runtime perspective) random samphng result, but with a slightly worse communi- 
cation bound. 

Theorem 4.1. The deterministic version of two-party two-way protocol WeightedSampling for linear 
separators in M'^ misclassifies at most £\D\ points after T = 0(log(l/e)) rounds using 0{d^ log(l/e)) words 
of communication. 

Proof. At the start of each round t, let (pt be the potential function given by the sum of weights of all 
points in that round. Initially, (pi = YIx^&Db = n since by definition for each point Xi E Db we have 
Wi = 1. 

Then in each round, A constructs a classifier /i^ at B to correctly classify the set of points that accounts 
for at least 1 — c fraction of the total weight by Lemma |4.1[ All other misclassified points are upweighted 
by (1 + /o). Hence, for round {t + 1) we have 0*+^ < ((1 - c) + c(l + p)) = (/)*(! + cp) = n (1 + cpf . 

Let us consider the weight of the points in the set S C Db that have been misclassified by a majority 
of the T classifiers (after the protocol ends). This implies every point in S has been misclassified at least 
T/2 number of times and at most T number of times. So the minimum weight of points in S" is (1 + p)'^^'^ 
and the maximum weight is (1 + p)^ . 

Let rij be the number of points in S that has weight (1 + pY where i G [T/2,T]. The potential function 
value of S after T rounds is (fg = Yli=T/2 ^ii^ + pY- claim is that Yli=i "^i — \^\ — Each of these 
at most \S\ points have a weight of at least (1 + p)"^^^. Hence we have that 

4=Y: n,{l + py>{l + pf/' E n. = (l + p)^/2|5|. 

i=T/2 i=T/2 

Relating these two inequalities we obtain the following, 

\S\{l + pr'<<P^s<<t>^ = n{l + cpf. 

Hence (using T = 51og2(l/e)) 

Setting c = 0.2 and p = 0.75 we get 51og2 ((1 + cp)/{l + p)^/^)) < -1 and thus IS*] < n(l/e)"^ < en, as 
desired since e < 1. Thus each round uses 0(d) points, each requiring d words of communication, yielding 
a total communication of 0{d'^ log(l/e)). □ 

In order to use random sampling (as suggested in Algorithm [2]) , we need to address the probability of 
failure of our protocol. That is, more specifically the set Sc,d in Lemma l4.ll is of size 0{dlog{l/6')) and 
a linear classifier that has no error on Sc^d misclassifies points in Db with weight at most cJ2xeDB ^(^)' 
with probability at least 1 — 6' . 

However, we would like this probability of failure to be a constant 5 over the entire course of the protocol. 
To guarantee this, we need the c-misclassification property to hold in each of T rounds. Setting 6' = 6/T, 
and applying the union bound implies that then the probability of failure at any point in the protocol is at 
most Yli=i ^' — Y^=i ^/'^ — ^- This increases the communication cost of each round to 0{d^ log(l/(5')) = 
0(d^ log(log(l/e)/5)) = 0(d^ log log(l/e)) words, with a constant 5 probability of failure. Hence using 
random sampling as described in WeightedSampling requires a total of 0{d'^ log(l/e) log log(l/e)) words 
of communication. We formalize below. 
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Theorem 4.2. The randomized two-party two-way protocol WeightedSampling for linear separators 
in misclassifies at most e\D\ points, with constant probability, after T = 0(log(l/e)) rounds using 
0((i^ log(l/e) loglog(l/e)) words of communication. 

5 /c- Party Protocol 

In Section [3] we described a simple protocol (Theorem 13. ip to learn a classifier with e-error jointly among k 
parties using 0{kd + dmhi{v / e\og{v / e) ,1/ / e^}) words of total communication. We now combine this with 
the two-party protocol from Section [J] to obtain a fc-player protocol for learning a joint classifier with error 
e. 

We fix an arbitrary node (say Pi) as the coordinator for the /c-player protocol of Theorem 13.11 Then Pi 
runs a version of the two-player protocol (from Section^) from j4's perspective and where players P2, . . . , P^ 
serve jointly as the second player B. To do so, we follow the distributed sampling approach outlined in 
Theorem 13.11 Specifically, we fix a parameter c (set c = 0.2). Each other node reports the total weight 
w{Di) of their data to Pi, who then reports back to each node what fraction of the total data w{Di)/w{D) 
they own. Then each player sends the coordinator a random sample of size Sc^d'w{Di)/w{D). Recall that we 
require Sc,d = 0('^loglog(l/e)) in this case to account for probability of failure over all rounds. The union 
of these sets at Pi satisfies the sampling condition in Lemma l4.ll for U^^2-^j- ^1 computes a classifier on 
the union of its data and this joint sample and all previous joint samples, and sends the resulting classifier 
back to all the nodes. Sending this classifier to each party requires 0{kd) words of communication. The 
process repeats for T = log2(l/e) rounds. 

Theorem 5.1. The randomized k -party protocol for s- error linear separators in Mf^ terminates in T = 
0(log(l/e)) rounds using 0((A:(i + log log(l/e)) log(l/e)) words of communication, and has a constant 
probability of failure. 

Proof. The correctness and bound of T = 0(log(l/e)) rounds follows from Theorem 14.11 since, aside from 
the total weight gathering step, from party Pi's perspective it appears to run the protocol with some party B 
where B represents parties P2 , P3 , . . . , P^ . The communication for Pi to collect the samples from all parties 
is 0{kd+dsc^d) = 0{kd+d'^ log log(l/e)). And it takes 0{dk) communication to return Ha to all k — 1 other 
players. Hence the total communication over T = 0(log(l/e)) rounds is 0((A;d + log log(l/e)) log(l/e)) 
as claimed. □ 

However, this randomized sampling algorithm required a sample of size Sc.d = 0((ilog log(l/e)), we can 
achieve a different communication trade-off using the deterministic construction. We can no longer use the 
result from Theorem 13.11 since that has a probability of failure. In this case, in each round each party Pj 
communicates a deterministically constructed set of size Sc,d = 0{d), then the coordinator Pi computes 
a classifier that correctly classifies points from all of these sets, and hence has at most cw{Di) weight of 
points misclassified in each Di. The error is at most cw{Di) on each dataset Di, so the error on all sets is 
at most £^(^2 'u^(-Cj) = cw{D). Again using T = 0(log(l/e)) rounds we can achieve the following result. 

Theorem 5.2. The deterministic k-party protocol for e-error linear separators in terminates in T = 
0(log(l/e)) rounds using 0{kd^ log{l/e)) words of communication. 

6 Experiments 

In this section, we present empirical results, using WeightedSampling, for finding linear classifiers in 
for two-party and A;-party scenarios. We empirically compare amongst the following approaches. 

• Naive: a naive approach that sends all data from (k — 1) nodes to a coordinator node and then 
learns at the coordinator. For any dataset, this accuracy is the best possible. 
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Voting: a simple voting strategy that trains classifiers at each individual node and sends over the 
{k — 1) classifiers to a coordinator node. For any datapoint, the coordinator node predicts the label 
by taking a vote over all k classifiers. 

Rand: each of the {k — 1) nodes sends a random sample of size Ss^d to a coordinator node and then 
a classifier is learned at the coordinator node using all of its own data and the samples received. 

RandEmp: a cheaper version of Rand that uses a random sample of size 9d from each party each 
round; this value was chosen to make this baseline technique as favorable as possible. 



Max Marg: IterativeSupports that selects informative points heuristically (jPaume III et al 



20121 ). A node is chosen as the coordinator and the coordinator exchanges maximum margin support 



points with each of the {k — 1) nodes. This continues until the training accuracy reaches within 
(1 — e) of the optimal (i.e., (1 — e)100% in our case since we assume linearly separable data) or the 
communication cost equals the total size of the data at {k — 1) non-coordinator nodes (i.e., the cost 
for Naive). 

Mwu: WeightedSampling that randomly samples points based on the distribution of the weights 
and runs for 51og(l/e) number of rounds (ref. Section H]). 

MwuEmp: a cheaper version of Mwu with an early stopping condition. The protocol is stopped 
early if the training accuracy has reached within (1 — e) of the optimal, i.e., (1 — e)100%. 



We do not compare results with Median (jPaume III et al.l . 120121') as it does network on datasets beyond 



two dimensions. For all these methods, SVM (from libSVM ( Chang Sz Lin . 201 ll ) library), with a linear 
kernel, was used as the underlying classifier. We report training accuracy and communication cost. The 
training accuracy is computed over the combined dataset D with an e value of 0.05 (where applicable). 
The communication cost (in words) of all methods are reported as ratios with reference to MwuEmp as 
the base method. All numbers reported are averaged over 10 runs of the experiments; standard deviations 
are reported where appropriate. For Mwu and MwuEmp, we use p = 0.75. 



Communication Cost Computation. In the following, we describe the communication cost computation 
for each method. Each example point sent from one node to another incurs a communication cost of d + 1, 
since it requires d words to describe its position in and 1 word to describe its sign. Similarly, each linear 
classifier requires d + 1 words of communication to send; d words to describe its direction, and 1 word to 
describe its offset. 

• Naive: assuming node 1 to be coordinator, the total cost is the number of words sent over by each 
node to the coordinator and is equal to Yli=2i'^ + 

• Voting: each node sends over its classifier to the coordinator node which incurs a total cost of 
{d + l)ik-l). 

• Rand: the cost is equal to (A; — l){d+l)si,^d = {k — l)(d + l){d/£) log(d/e) times some constant where 
we set the constant to 1. 

• RandEmp: despite the theoretical cost of {k — l){d + l)se^d = (k — l){d + l)((i/e) log(d/e) (same 
as Rand), in practice the random sampling based approach performs well with far fewer samples. 
Starting with a sample size of 5, we first perform a doubling search to find the range within which 
RandEmp achieves e-optimal accuracy and then do binary search within this range to pick the 
smallest value for the sample size. Our goal is to pick one value that performs well across all of 
our datasets. In our case, 9d seems to work well for all the datasets we tested. Thus, in our case, 
RandEmp incurs a total cost of 9d{d + 1)(A; — 1) words. 

• MaxMarg: let SPi denote the support set of node i. Assuming node 1 to be coordinator, the total 
cost in each round is equal to {d + l){k — l)|SPi| + Yli=2i'^ + (the number words sent by 
the coordinator to all {k — 1) nodes plus the number of words sent back by the {k — 1) nodes to the 
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coordinator). The cost accumulates over rounds until the target accuracy is reached or until the cost 
equals the total size of the data at (k — 1) non-coordinator nodes (i.e., the cost for Naive). 

• Mwu: for our algorithm the cost incurred in each round is {d+l)sc^dif^—^) + {d+^)ik—l) words. The 
first term comes from each player other than the coordinator sending Sc,d points to the coordinator. 
The second term accounts for the coordinator replying with a classifier to each of those {k — 1) other 
players. However, we observe that exchanging a small constant number of samples, instead of Sc^^, 
each round works quite well in practice for all of our datasets. For our analysis we had set c = 1/5 
indicating that Sc^d is some constant times 25d. But in our experiments, we use a much smaller 
sample size of 100 per round, with a word cost of 100{d + 1) per round. The search process to find 
this smaller sample size is the same as described in RandEmp. The number of rounds for Mwu is 
5 log2 (1/0.05) log2log2 (1/0.05) ~5x 10 = 50. 

• MwuEmp: similar to Mwu, the sample size chosen in 100 and the cost is 100(d + l){k — 1) + {d + 
l){k — 1) words times the number of rounds until the early stopping criterion is met. 

Note that given our cost computation, for some datasets the cost of Rand, RandEmp and Mwu can 
exceed the cost of Naive (see, for example, Cancer). For those datasets, the size of the data is small 
compared to the dimensions. As a result, the communication costs (in number of points) for (a) Rand: 
(k - l)ss,d = {k- l){d/e)log{d/e), (b) RandEmp: 9d{k - 1), and (c) Mwu: (100(A; - 1) + {k - 1))T = 
Wl{k — 1) X 50 are large compared to the total size of the data at the {k — 1) non-coordinator nodes (i.e., 
the cost of Naive). 



Datasets. We report results for two-party and four-party protocols on both synthetic and real-world 
datasets. 

Six datasets, three each for two-party and four-party case, have been generated synthetically from 
mixture of Gaussians. Each Gaussian has been carefully seeded to generate different data partitions. For 
Synthetic 1 , Synthetic2, Synthetic^, Synthetics, each node contains 5000 data points (2500 positive and 
2500 negative) whereas for Synthetics and Synthetic6 , each node contains 8500 data points (4250 positive 
and 4250 negative) and all of these datapoints lie in 50 dim ensions. Additiona l ly, w'e investigate the 
performance of our protocols on three real-world UCI datasets ffl). Our goal is to 

select datasets that are linearly se parable or almost lin early separable. We choose Cancer and Mushroom 
from the LibSVM data repository ( Chang &: Linl . lioill ) . 

The proposed protocol works for perfectly separable datasets. However, this assumption is too idealistic 
and in practice real-world datasets are seldom perfectly separable either because of presence of noise or 
due to limitations of linear classifiers (for example, what if the data has a non-linear decision boundary). 
So most of datasets have some amount of noise in them. This also shows that although our protocols were 
designed for noiseless data then work well on noisy datasets too. However, when applied on noisy data, we 
do not guarantee the communication bounds that were claimed for noiseless datasets. 

For the datasets that are not perfectly separable, the accuracy of Naive (with some tolerance) that 
learns an SVM on the entire data can be considered to be the best accuracy that can be achieved for that 
particular dataset. Table [T] presents a summary of the datasets, the best possible accuracy that can be 
achieved and also the accuracy required to yield an e-optimal classifier with e = 0.05. 

Finally, in Tables [2]|31 we highlight (in bold) the protocol that achieves the required accuracy and the 
lowest communication cost and thus is the best among the methods compared. By best we mean that 
the method has the cheapest communication cost as well an accuracy that is more that (1 — e) times the 
optimal, i.e., 95% for our case for e = 0.05. As will be frequently seen for Voting, the communication 
cost is the cheapest but the accuracy is far from the desired e-error specified, and in such circumstances 
we do not deem Voting as the best method. 
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Dataset 


total # 


# of points per player 


dimensions 


type 


perfectly 


best 


e-optimal 




of points 


2-player 


4-playcr 






separable? 


accuracy 


accuracy 


O yiliitcilLI 


1 noon 


ouuu 






c T Tin 1 In t^T 1 
O y llLlit^ lie 


no 




on 




10000 


OUUU 




50 




no 


97.91 


95.00 


^ 1 1Ti T hi (3 T1 C -i 

O y 1 ILl IX^LLL'O 


1 7000 


8500 




OK) 




no 


Q7 


Qf^ no 


SynthetiCj^ 


20000 




5000 


50 


synthetic 


no 


99.26 


95.00 


Synthetics 


20000 




5000 


50 


synthetic 


no 


97.97 


95.00 


SyntheticG 


34000 




8500 


50 


synthetic 


no 


97.47 


95.00 


Cancer 


683 


342 


171 


10 


real 


no 


97.07 


95.00 


Mushroom 


8124 


4062 


2031 


112 


real 


yes 


100 


95.00 



Table 1: Summary of datasets used (e = 0.05). 





Syntheticl 


Synthetic2 


Synthetics 




Acc 


Cost 


Acc 


Cost 


Acc 


Cost 


Naive 


99.23 (0.0) 


49.02 


97.91 (0.0) 


6.18 


97.39 (0.0) 


19.08 


Voting 


95.00 (0.0) 


0.01 


60.64 (0.0) 


0.01 


74.55 (0.0) 


0.01 


Rand 


99.02 (0.0) 


29.41 


97.72 (0.0) 


3.71 


97.16 (0.0) 


6.74 


RandEmp 


96.64 (0.1) 


4.41 


95.13 (0.1) 


0.56 


96.03 (0.1) 


1.01 


MaxMarg 


96.39 (0.0) 


4.26 


93.76 (0.0) 


6.18 


73.62 (0.0) 


19.08 


Mwu 


98.66 (0.1) 


49.51 


97.59 (0.1) 


6.24 


97.11 (0.1) 


11.34 


MwuEmp 


95.00 (0.0) 


1.00 


95.17 (0.1) 


1.00 


95.25 (0.2) 


1.00 



Table 2: Mean accuracy (Acc) and communication cost (Cost) required by two-party protocols for synthetic 
datasets. 





Synthetic4 


Synthetics 


SyntheticG 




Acc 


Cost 


Acc 


Cost 


Acc 


Cost 


Naive 


99.26 (0.0) 


100.00 


97.97 (0.0) 


12.72 


97.47 (0.0) 


54.84 


Voting 


95.00 (0.0) 


0.01 


65.83 (0.0) 


0.01 


75.52 (0.0) 


0.01 


Rand 


99.18 (0.0) 


60.00 


97.83 (0.0) 


7.63 


97.39 (0.0) 


19.35 


RandEmp 


97.33 (0.1) 


9.00 


96.61 (0.1) 


1.15 


96.67 (0.1) 


2.90 


MaxMarg 


95.95 (0.0) 


0.82 


93.94 (0.0) 


15.15 


75.05 (0.0) 


80.19 


Mwu 


98.03 (0.2) 


34.78 


97.30 (0.1) 


4.45 


96.87 (0.1) 


11.24 


MwuEmp 


95.11 (0.3) 


1.00 


95.11 (0.2) 


1.00 


95.45 (0.2) 


1.00 



Table 3: Mean accuracy (Acc) and communication cost (Cost) required hy four-party protocols for synthetic 
datasets. 

6.1 Synthetic Results 

Table [2] compares the performance metrics of the aforementioned protocols for iwo-parties. As can be seen, 
Voting performs the best for Syntheticl and RandEmp performs the best for Synthetic2. For Synthetic3 , 
MwuEmp requires the least amount of communication to learn an e-optimal distributed classifier. Note 
that, for Synthetic2 and Synthetics, both Voting and MaxMarg fail to produce a e-optimal (e = 0.05) 
classifier. MaxMarg exhibits this behavior despite incurring a communication cost that is as high as 
Naive. Note that the cost of MaxMarg being the same as Naive does not imply that MaxMarg send 
overs all points. Rather the accumulated cost of the support points become the same as the cost of Naive 
at which point we stop the algorithm. Usually, by this point, the accuracy of MaxMarg saturates and 
does not improve with exchange of more support points. 

As shown in Table [Sj most of the two-party results carry over to the multiparty case. Voting is the 
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best for Synthetic^ whereas MwuEmp is the best for Synthetics and Syntheticd. As earher, both Voting 
and MaxMarg do not yield an 0.05-optimal distributed classifiers for Synthetics and SyntheticG. 

Figure [T] (for two-party using Syntheticl) shows the communication costs (in log-scale) with variations 
in the number of data points per node and the dimension of the data. Note that we do not report the 
numbers for MaxMarg since MaxMarg takes a long time to finish. However, for Syntheticl the numbers 
for MaxMarg are similar to those of RandEmp and so their curves in the figure are also the same. Note 



that in Figure 1(b) , the cost of Naive increases as the number of dimensions increase. This is because the 



cost is multiplied by a factor of (d + 1), when expressed in words. 
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Figure 1: Communication cost vs Size and Dimensionality for Syntheticl with 2-party protocol. 



6.2 Real-World Data 

Table m presents results for two-party protocols and four-party protocols using real- world datasets. Other 
that the two-party case for Mushroom, Voting performs the best in all other case. However, note that 
for Mushroom using two-party protocol, Voting does not yield a 0.05-optimal distributed classifier. 

The results for communication cost (in log- scale) versus data size and communication cost (in log- 
scale) versus dimensionality are provided in Figure [2] for two-party protocol using the Mushroom dataset. 
MwuEmp (denoted by the black line) is comparable to MaxMarg and cheaper than all other baselines 
(except Voting). 



Remarks. The goal of our experiments is to show that our protocols perform well, particularly for difficult 
or adversarially partitioned datasets. For easy datasets, any baseline technique can perform well. Indeed, 
Voting performs the best on Syntheticl and Synthetic-^ and RandEmp performs better than others 
on Synthetic2. For the remaining three cases on synthetic datasets, MwuEmp outperforms the other 
baselines. On real world data. Voting usually performs well. However, as we have shown earlier, for some 
datasets Voting and MaxMarg fail to yield an e-optimal classifier. In particular for Mushroom, using 
the two-party protocol, the accuracy achieved by Voting is far from e-optimal. This and earlier results 
show that there exists scenarios where Voting and MaxMarg perform particularly worse and so learning 
by majority voting or by exchanging support points in between nodes is not a good strategy in distributed 
settings, even more so when the data is partitioned adversarially. 
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Cancer 


Mushroom 




Acc 


Cost 


Acc 


Cost 


2-party 


Naive 


97.07 (0.0) 


3.34 


100.00 (0.0) 


20.01 


Voting 


97.36 (0.0) 


0.01 


88.38 (0.0) 


0.00 


Rand 


97.16 (0.1) 


4.52 


100.00 (1.1) 


36.97 


RandEmp 


96.90 (0.2) 


0.88 


100.00 (0.0) 


4.97 


MaxMarg 


96.78 (0.0) 


0.22 


100.00 (0.0) 


1.11 


Mwu 


97.36 (0.2) 


49.51 


100.00 (0.0) 


24.88 


MwuEmp 


96.87 (0.4) 


1.00 


99.73 (0.5) 


1.00 


4-party 


Naive 


97.07 (0.0) 


1.00 


100.00 (0.0) 


28.61 


Voting 


97.36 (0.0) 


0.03 


95.67 (0.0) 


0.01 


Rand 


97.19 (0.1) 


12.81 


100.00 (0.6) 


105.70 


RandEmp 


96.99 (0.1) 


2.50 


99.99 (0.0) 


14.20 


MaxMarg 


96.78 (0.0) 


0.56 


100.00 (0.0) 


2.34 


Mwu 


97.00 (0.2) 


48.46 


100.00 (0.1) 


24.65 


MwuEmp 


96.97 (0.3) 


1.00 


98.86 (0.4) 


1.00 



Table 4: Mean accuracy (Acc) and communication cost (Cost) required by all protocols for real-world 
datasets. 
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Figure 2: Communication cost vs Size and Dimensionality for Mushroom with 2-party protocol. 
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7 Distributed Optimization 



Many learning problems can be formulated as convex (or even linear or semidefinite) optimizations ([Bennett &: Parrado- 



mv 

20061 ) ■ In these problems, the data (points) act as constraints to the resulting optimization; for example, 



in a standard SVM formulation, there is one constraint for each point in the training set. 

Since in our distributed setting, points are divided among the different players, a natural distributed 
optimization problem can be stated as follows. Each player i has a set of constraints Cj = {fij{x) > 0}, 
and the goal is to solve the optimization mmg[x) subject to the union of constraints UjCj. As earlier, our 
goal is to solve the above with minimum communication. 

A general solution for communication-efficient distributed convex optimization will allow us to reduce 
communication overhead for a number of distributed learning problems. In this section, we illustrate two 
algorithm design paradigms that achieves this for distributed convex optimization. 

7.1 Optimization via Multi-Pass Streaming 

A streaming algorithm (iMuthukrishnan . takes as input a sequence of items xi, . . . x„. The algorithm 



is allowed working space that is suhlinear in n, and is only allowed to look at each item once as it streams 
past. A multipass streaming algorithm is one in which the algorithm may make more than one pass over 
the data, but is still limited to sub linear working space and a single look at each item in each pass. 

The following lemma shows how any (multipass) streaming algorithm can be used to build a multiparty 
distributed protocol. 

Lemma 7.1. Suppose that we can solve a given problem P using a streaming algorithm that has s words 
of working storage and makes r passes over the data. Then there is a k-player distributed algorithm for P 
that uses krs words of communication. 

Before proving the above lemma, we note that streaming algorithms often have s = O (poly log n) and 
r = O(logn), indicating that the total communication is 0{k poly log n) words, which is sublinear in the 
input size. 

Proof. For ease of exposition, let us first consider the case when k = 2. Consider a streaming algorithm S 
satisfying the conditions above. The simulation works by letting the first player A simulate the first half of 
S, and letting the second player B simulate the second half. Specifically, the first player A simulates the 
behavior of S on its input. When this simulation of S exhausts the input at A, A sends over the contents 
of the working store of S to B. B restarts S on its input using this working store as S"s current state. 
When B has finished simulating S on its input, it sends the contents of the working storage back to A. 
This completes one pass of S, and used s words of communication. The process continues for r passes. 

If there are k players Ai, . . . , Ak instead of two, then we fix an arbitrary ordering of the players. The 
first player simulates S on its input, and at completion passes the contents of the working store to the next 
one, and so on. Each pass now requires 0{ks) words of communication, and the result follows. □ 

We can apply this lemma to get a streaming a lgorithm for fixed-dimensional linear programming. This 
relies on an existing result (IChan &: Chenl . 



Theorem 7.1 f dChan fc Chenl . l2007l )). Given n half spaces in (for d constant), we can compute the 



lowest point in their intersection by a 0{l/5'^ ^)-pass Las Vegas algorithm that uses 0{{1/ 5^^^^)n^) space 
and runs in time 0{{1 / 5^^^'')n^~^^) with high probability, for any constant 5 > 0. 



^Fixed-dimensional linear programming is the case of linear programming where the dimension is not part of the input. 
Effectively, this means that exponential dependence on the dimension is permitted; the dependence on the number of constraints 
remains polynomial as usual. 
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Corollary 7.1. There is a k-player algorithm for solving distributed linear programming that uses 0{k{l/6'^~^'^^^^)n^) 
communication, for any constant 6 > 0. 

While the above streaming algorithm can be applied as a blackbox in Corollary 17.11 looking deeper into 
the streaming algorithm reveals room for improvement. As in the case of classification, suppose that we 
are permitted to violate an e-fraction of the constraints. It turns out that the above streaming algorithm 
achieves its bounds by eliminating a fixed fraction of constraints in each space, and thus requires log,, n 
passes, where r = n^^^\ If we are allowed to violate an e-fraction of constraints, we need only run the 
algorithm for log^ 1/e passes, where r is now 0{l/e'^^^^). This allows us to replace n in all terms by 1/e, 
resulting in an algorithm with communication independent of n. 

Corollary 7.2. There is a k-player algorithm for solving distributed linear programming that violates at 
most an e-fraction of the constraints, and that uses 0{k{l/5'^^^^^^){l/e)^) communication, for any constant 
6>0. 



7.2 Optimization via Multiplicative Weight Updates 

The above result gives an approach for solving fixed- dimensional linear programming (exactly or with 
at most en violated constraints) in a distributed setting. There is no known streaming algorithm for 
arbitrary-dimensional linear programming, so the stream-algorithm-based design strategy cannot be used. 
However we will now show that the multiplicative weight update method can be applied in a distributed 
manner, and this allows us to solve general linear programming problems, as well as SDPs and other convex 
optimizations. 

We first consider the problem of solving a general LP of the form min g~^x, subject to Ax > b, x G P, 
where P is a set of "soft" constraints (for example, x > 0) and Ax > b are the "hard" constraints. Let 
z* = min^i^x* be the optimal value of the LP, obtained at x*. Then the multiplicative weight update 
method can be used to obtain a solution x such that z* = g~^x and all (hard) constraints are satisfied 
approximately, i.e Vi, AiX > bi — e, where AiX > bi is one row of the constraint matrix. We call such a 
solution a soft-e- approximation (to distinguish it from a traditional approximation in which all constraints 
would be satisfied exactly and the objective would be approxima tely achieved. 



The standard protocol works as follows (jArora et al. 



2005al ). We assume that the optimal z* has 



been guessed (this can be determined by binary search), and define the set of "soft" constraints to be 
7 = P L) {x I g^x = z*}. Typically, it is easy to check for feasibility in CP. We define a width parameter 
p = max{maxjg[„] a-gy 74jX — 6j,l}. Initialize m-j(O) = 0. Then we run T = 0(/5^1nn/e^) iterations (with 
t = 1, 2, . . . , T) of the following: 

1. Set pi(t) = exp{-emi{t -l)/2). 

2. Find feasible x{t) in 7 U {x \ YjiPi^iX > ^iPih}- 

3. mi{t) = mi{t - 1) -I- Aix{t) - bi. 

At the end, we return x = (1/t) ^^t-^i^) soft-e-approximation for the LP. 

We now describe a two-party distributed protocol for linear programming adapted from this scheme. 
The protocol is asymmetric. Player A finds feasible values of x and player B maintains the weights mj. 
Specifically, player A constructs a feasible set T consisting of the original feasible set P and all of its own 
constraints. As above, B initializes a weight vector m to all zeros, and then sends over the single constraint 
Y^^PiAiX > J2iPi^i to ^- Player A then finds a feasible x using this constraint as well as T (solving a 
linear program) and then sends the resulting x back to B, who updates its weight vector m. 

Each round of communication requires 0{d) words of information, and there are 0(p^ Inn/e^) rounds of 
communication. Notice that this is exponentially better than merely sending over all constraints. 

Theorem 7.2. There is a 2-player distributed protocol that uses O^dp^lnn/ e^) words of communication 
to compute a soft-e-approximation for a linear program. 
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A simila r result appl i es for semidefinite programming (based on an existing primal MWU-based SDP 
algorithm (lArora et al.1. l2005bl^) as well as other optimizations for which the MWU applies, such as rank 
minimization ( Meka et al. . 20081 ). etc. 



8 Conclusion 

In this work, we have proposed a simple and efficient protocol that learns an e-optimal distributed classifier 
for hyperplanes in arbitrary dimensions. The protocol also gracefully extends to /c-players. Our proposed 
technique WeightedSampling relates to the MWU-based meta framework and we exploit this connection 
to extend WeightedSampling for distributed convex optimization problems. This makes our protocol 
applicable to a wide variety of distributed learning problems that can be formulated as an optimization 
task over multiple distributed nodes. 
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